In an ever-volatile market where every dollar counts, the used car market represents a critical sector of the consumer industry. With rising consumer demand and an increasing variety of vehicles entering the secondary market, understanding the factors that influence used car prices is essential. Buyers seek to make informed decisions based on value for money, while sellers aim to maximize returns by accurately pricing their vehicles. Bridging this gap requires a data-driven approach to uncover the relationships between vehicle specifications and market pricing.
This project utilizes the “Vehicle dataset” by Nehal Birla, Nishant Verma, and Nikhil Kushwaha, available on Kaggle
This dataset aggregates detailed information on over 10,000 used cars, including key variables such as fuel type, transmission, engine capacity, mileage, and kilometers driven, alongside categorical variables like seller type, ownership history, and geographic location.
Our analysis is driven by three core objectives:
Using statistical modeling techniques, including regression analysis, we aim to deliver a robust and interpretable model that not only predicts car prices but also highlights the most influential factors driving price variations. Our results will shed light on market dynamics, offering actionable insights for both consumers and industry professionals navigating this volatile space.
By the conclusion of this project, we aim to provide a detailed analysis that enhances understanding of the used car market, aiding stakeholders in making informed decisions in an ever-changing economic landscape.
The dataset contains the following attributes
| Attributes | Description | |
|---|---|---|
| 1 | Name | The brand name of the vehicle (e.g., Hyundai i10, Honda City). |
| 2 | Year | The year the vehicle was manufactured. |
| 3 | Selling Price | The selling price of the vehicle in INR. |
| 4 | Km Driven | The total distance the vehicle has been driven. |
| 5 | Fuel | The type of fuel used by the vehicle (e.g Petrol, Diesel) |
| 6 | Seller Type | The type of seller (e.g Individual, Dealer). |
| 7 | Transmission | The type of transmission system (e.g Manual, Automatic). |
| 8 | Owner | The number of previous owners of the vehicle (e.g., First Owner, Second Owner). |
| 9 | Mileage | The fuel efficiency of the vehicle.(in km/l or km/kg) |
| 10 | Engine | The engine displacement capacity.(in CC) |
| 11 | Max Power | The maximum power output of the vehicle’s engine measured in horsepower.(in HP). |
| 12 | Torque | The maximum torque. |
| 13 | Seats | The seating capacity of the vehicle (e.g 5-seater, 7-seater). |
The above attributes can be categorized into Numeric and Categorical variables, with selling price as continuous numeric response variable.
| Attribute Type | Attribute |
|---|---|
| Categorical Variables | Name, Fuel, Seller Type, Transmission, Owner |
| Discrete Numeric Variables | Year, Km Driven, Seats |
| Continuous Numeric Variables | Selling Price, Mileage, Engine, Max Power, Torque |
car_details = read_csv("Car details v3.csv")
nrow(car_details)
## [1] 8128
car_details = na.omit(car_details)
nrow(car_details)
## [1] 7906
sum(duplicated(car_details))
## [1] 1189
car_details = car_details[!duplicated(car_details),]
nrow(car_details)
## [1] 6717
head(car_details)
## # A tibble: 6 × 13
## name year selling_price km_driven fuel seller_type transmission owner
## <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr>
## 1 Maruti Swi… 2014 450000 145500 Dies… Individual Manual Firs…
## 2 Skoda Rapi… 2014 370000 120000 Dies… Individual Manual Seco…
## 3 Honda City… 2006 158000 140000 Petr… Individual Manual Thir…
## 4 Hyundai i2… 2010 225000 127000 Dies… Individual Manual Firs…
## 5 Maruti Swi… 2007 130000 120000 Petr… Individual Manual Firs…
## 6 Hyundai Xc… 2017 440000 45000 Petr… Individual Manual Firs…
## # ℹ 5 more variables: mileage <chr>, engine <chr>, max_power <chr>,
## # torque <chr>, seats <dbl>
# Extract the first word from "name" - to get the "make" of the vehicle
car_details$name = word(car_details$name,1)
# Rename name to make
colnames(car_details)[1] = "make"
# Change datatypes of "make" from character to factor
car_details$make = as.factor(car_details$make)
# Extract the numeric value from mileage
car_details$mileage = word(car_details$mileage,1)
#Change data_type of "mileage" from character to numeric
car_details$mileage=as.numeric(car_details$mileage)
# Extract the numeric value from engine
car_details$engine = word(car_details$engine,1)
#Change data_type of "engine" from character to numeric
car_details$engine=as.numeric(car_details$engine)
# Extract the numeric value from max_power
car_details$max_power = word(car_details$max_power,1)
#Change data_type of "max_power" from character to numeric
car_details$max_power=as.numeric(car_details$max_power)
# Change datatypes of "fuel", "seller_type", "transmission","owner"
# From character to factor
car_details$fuel=as.factor(car_details$fuel)
car_details$seller_type=as.factor(car_details$seller_type)
car_details$transmission=as.factor(car_details$transmission)
car_details$owner=as.factor(car_details$owner)
#colnames(car_details)[7]="AM"
##KM driven in ten thousands
car_details$km_driven = car_details$km_driven/10000
#Rename km_driven
colnames(car_details)[4]="km_driven_in_10k"
#Dropping column torque
car_details = car_details[,!names(car_details) %in% "torque"]
#Selling price in ten thousands
car_details$selling_price = car_details$selling_price/10000
#Rename Selling price
colnames(car_details)[3]="selling_price_in_10k"
head(car_details)
## # A tibble: 6 × 12
## make year selling_price_in_10k km_driven_in_10k fuel seller_type
## <fct> <dbl> <dbl> <dbl> <fct> <fct>
## 1 Maruti 2014 45 14.6 Diesel Individual
## 2 Skoda 2014 37 12 Diesel Individual
## 3 Honda 2006 15.8 14 Petrol Individual
## 4 Hyundai 2010 22.5 12.7 Diesel Individual
## 5 Maruti 2007 13 12 Petrol Individual
## 6 Hyundai 2017 44 4.5 Petrol Individual
## # ℹ 6 more variables: transmission <fct>, owner <fct>, mileage <dbl>,
## # engine <dbl>, max_power <dbl>, seats <dbl>
unique(car_details$make)
## [1] Maruti Skoda Honda Hyundai Toyota
## [6] Ford Renault Mahindra Tata Chevrolet
## [11] Datsun Jeep Mercedes-Benz Mitsubishi Audi
## [16] Volkswagen BMW Nissan Lexus Jaguar
## [21] Land MG Volvo Daewoo Kia
## [26] Fiat Force Ambassador Ashok Isuzu
## [31] Opel
## 31 Levels: Ambassador Ashok Audi BMW Chevrolet Daewoo Datsun Fiat ... Volvo
car_details$make_category = ifelse(car_details$make %in%
c("Maruti", "Tata", "Mahindra", "Datsun",
"Renault", "Chevrolet", "Fiat",
"Daewoo", "Ambassador", "Ashok"),
"Budget",
ifelse(car_details$make %in%
c("Honda", "Ford", "Hyundai","Toyota", "Volkswagen","Nissan",
"Skoda", "Mitsubishi","Force", "Kia","MG"),
"Midrange",
"Luxury"))
car_details$make_category = as.factor(car_details$make_category)
levels(car_details$make_category)
## [1] "Budget" "Luxury" "Midrange"
structure_car_details = data.frame(
Column = names(car_details),
Type = unname(sapply(car_details, class)))
print(structure_car_details)
## Column Type
## 1 make factor
## 2 year numeric
## 3 selling_price_in_10k numeric
## 4 km_driven_in_10k numeric
## 5 fuel factor
## 6 seller_type factor
## 7 transmission factor
## 8 owner factor
## 9 mileage numeric
## 10 engine numeric
## 11 max_power numeric
## 12 seats numeric
## 13 make_category factor
car_details$selling_price_in_10k[car_details$selling_price_in_10k > 720]
## [1] 1000
car_details=subset(car_details,subset = car_details$selling_price_in_10k < 720,)
hist(car_details$selling_price_in_10k,xlab="Selling Price (in Ten Thousands)",
main="Selling Price distribution",
col = "lightblue")
make_counts = table(car_details$make)
fuel_type_car_count = table(car_details$fuel)
seller_type_car_count = table(car_details$seller_type)
trans_type_car_count = table(car_details$transmission)
owner_type_car_count = table(car_details$owner)
par(mfrow = c(2, 3))
# Plot 1: Number of cars by make
barplot(sort(make_counts),horiz=TRUE, las = 1,
xlab = "Number of Cars",
ylab = "Car Make",
col = "lightblue",
cex.names = 0.5,
main = "Num. of Cars in Each Make")
# Plot 2: Number of cars by fuel type
barplot(sort(fuel_type_car_count), horiz = TRUE, las = 1, cex.names = 0.9,
col = "lightblue", main = "Num. of Cars in Each Fuel Type")
# Plot 3: Number of cars by seller type
barplot(sort(seller_type_car_count), horiz = TRUE, las = 1, cex.names = 0.8,
col = "lightblue", main = "Num. of Cars in Each Seller Type")
# Plot 4: Number of cars by trans type
barplot(sort(trans_type_car_count), horiz = TRUE, las = 1, cex.names = 0.8,
col = "lightblue", main = "Num. of Cars in Each Trans Type")
# Plot 5: Number of cars by owner type
barplot(sort(owner_type_car_count), horiz = TRUE, las = 1, cex.names = 0.6,
col = "lightblue", main = "Num. of Cars in Each Owner Type")
boxplot(selling_price_in_10k ~ make, data = car_details,col=rainbow(length(unique(car_details$make))),
las = 2,cex.axis = 0.7,
main = "Car Make Vs Selling Price",
xlab = "Car Make",
ylab = "Selling Price (in 10k)")
par(mfrow = c(1, 2))
boxplot(selling_price_in_10k ~ fuel, data=car_details,cex.axis = 0.8,
col=rainbow(length(unique(car_details$fuel))),
main="SellingPrice based on fuel type")
plot(selling_price_in_10k ~ seller_type, data=car_details,cex.axis = 0.6,
col=rainbow(length(unique(car_details$seller_type))),
main="SellingPrice based on seller type")
par(mfrow = c(1, 2))
plot(selling_price_in_10k ~ transmission, data=car_details,
col=rainbow(length(unique(car_details$transmission))),
main="SellingPrice based on trans. type")
plot(selling_price_in_10k ~ owner, data=car_details,las=2,cex.axis = 0.5,
col=rainbow(length(unique(car_details$owner))),
main="SellingPrice based on owner type")
colours = ifelse(car_details$transmission == "Automatic", "blue", "red")
plot(selling_price_in_10k ~ year,data=car_details,col=colours,pch=19,
main = "Selling Price based on Year (with Transmission Type)",
xlab = "Year",
ylab = "Selling Price (in Ten Thousands)")
legend("topleft", legend = c("Automatic", "Manual"),
col = c("blue", "red"), pch = 19)
colours = c("Budget" = "blue", "Midrange"="green","Luxury"= "orange")
plot(selling_price_in_10k ~ year,data=car_details,
col = colours[car_details$make_category],pch=19,
main = "Selling Price based on Year (with make_category)",
xlab = "Year",
ylab = "Selling Price (in Ten Thousands)")
legend("topleft", legend = c("Budget","Midrange","Luxury"),
col = c("blue", "green","orange"), pch = 19)
par(mfrow = c(1, 2))
plot(selling_price_in_10k ~ km_driven_in_10k, data=car_details,
xlab="KM_Driven((in Ten Thousand)",
ylab="Selling_Price(in Ten Thousand)",
main="Price based on km_driven")
car_details = subset(car_details,car_details$km_driven_in_10k < 100,)
plot(selling_price_in_10k ~ km_driven_in_10k, data=car_details,
xlab="KM_Driven((in Ten Thousand)",
ylab="Selling_Price(in Ten Thousand)",
main="After removing extreme values")
plot(selling_price_in_10k ~ mileage, data=car_details,
xlab="Mileage",
ylab="Selling_Price(in Ten Thousand)",
main="Price based on Mileage")
pairs(selling_price_in_10k ~ year+km_driven_in_10k+mileage+engine+max_power+seats ,data=car_details)
cor_mat = cor(car_details[,sapply(car_details,is.numeric)])
cor_mat
## year selling_price_in_10k km_driven_in_10k
## year 1.00000000 0.4383902 -0.45453560
## selling_price_in_10k 0.43839024 1.0000000 -0.19714853
## km_driven_in_10k -0.45453560 -0.1971485 1.00000000
## mileage 0.36669288 -0.1257948 -0.23116300
## engine -0.02073416 0.4510495 0.30389598
## max_power 0.15893720 0.6872307 0.04747428
## seats 0.02483339 0.1656998 0.24746651
## mileage engine max_power seats
## year 0.3666929 -0.02073416 0.15893720 0.02483339
## selling_price_in_10k -0.1257948 0.45104953 0.68723067 0.16569977
## km_driven_in_10k -0.2311630 0.30389598 0.04747428 0.24746651
## mileage 1.0000000 -0.58120564 -0.39006970 -0.45881198
## engine -0.5812056 1.00000000 0.68630272 0.65884340
## max_power -0.3900697 0.68630272 1.00000000 0.26214681
## seats -0.4588120 0.65884340 0.26214681 1.00000000
high_cor = cor_mat[cor_mat > 0.5 & cor_mat != 1]
high_cor_indices = which(cor_mat > 0.5 & cor_mat != 1, arr.ind = TRUE)
high_cor_df = data.frame(
row = rownames(cor_mat)[high_cor_indices[, 1]],
column = colnames(cor_mat)[high_cor_indices[, 2]],
correlation = high_cor)
head(high_cor_df, 3)
## row column correlation
## 1 max_power selling_price_in_10k 0.6872307
## 2 max_power engine 0.6863027
## 3 seats engine 0.6588434
set.seed(125)
train_indices = sample(nrow(car_details), size = 0.80 * nrow(car_details))
train_data = car_details[train_indices, ]
test_data = car_details[-train_indices, ]
nrow(train_data)
## [1] 5370
nrow(test_data)
## [1] 1343
full_model = lm(selling_price_in_10k ~ .-make, data = train_data)
summary(full_model)
##
## Call:
## lm(formula = selling_price_in_10k ~ . - make, data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -206.057 -10.898 -1.291 8.093 313.265
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.227e+03 2.707e+02 -26.697 < 2e-16 ***
## year 3.591e+00 1.354e-01 26.518 < 2e-16 ***
## km_driven_in_10k -9.175e-01 9.513e-02 -9.645 < 2e-16 ***
## fuelDiesel 8.676e+00 4.266e+00 2.034 0.04204 *
## fuelLPG 1.168e+01 6.595e+00 1.771 0.07666 .
## fuelPetrol 6.844e-01 4.289e+00 0.160 0.87322
## seller_typeIndividual -8.027e+00 1.283e+00 -6.257 4.23e-10 ***
## seller_typeTrustmark Dealer -4.045e+00 5.454e+00 -0.742 0.45833
## transmissionManual -1.569e+01 1.577e+00 -9.947 < 2e-16 ***
## ownerFourth & Above Owner -2.610e+00 2.483e+00 -1.051 0.29315
## ownerSecond Owner -5.009e+00 8.999e-01 -5.566 2.73e-08 ***
## ownerTest Drive Car 2.425e+02 1.199e+01 20.222 < 2e-16 ***
## ownerThird Owner -4.364e+00 1.542e+00 -2.831 0.00466 **
## mileage 5.343e-02 1.598e-01 0.334 0.73805
## engine 1.089e-02 1.755e-03 6.206 5.84e-10 ***
## max_power 5.040e-01 2.053e-02 24.546 < 2e-16 ***
## seats 1.646e+00 5.983e-01 2.750 0.00598 **
## make_categoryLuxury 1.101e+02 3.120e+00 35.299 < 2e-16 ***
## make_categoryMidrange 2.385e+00 8.360e-01 2.853 0.00435 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 26.51 on 5351 degrees of freedom
## Multiple R-squared: 0.7247, Adjusted R-squared: 0.7238
## F-statistic: 782.6 on 18 and 5351 DF, p-value: < 2.2e-16
par(mfrow = c(1, 2))
plot(full_model,which=c(1,2))
vif(full_model)
## GVIF Df GVIF^(1/(2*Df))
## year 2.126959 1 1.458410
## km_driven_in_10k 1.639337 1 1.280366
## fuel 2.494514 3 1.164567
## seller_type 1.130152 2 1.031061
## transmission 1.447493 1 1.203118
## owner 1.401437 4 1.043090
## mileage 3.166319 1 1.779415
## engine 5.599938 1 2.366419
## max_power 3.094793 1 1.759202
## seats 2.537822 1 1.593054
## make_category 1.962517 2 1.183596
n= nrow(train_data)
bic_full_model = step(full_model,direction="both",k=log(n),trace=0)
summary(bic_full_model)
##
## Call:
## lm(formula = selling_price_in_10k ~ year + km_driven_in_10k +
## fuel + seller_type + transmission + owner + engine + max_power +
## make_category, data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -206.583 -10.943 -1.313 8.151 313.568
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.370e+03 2.431e+02 -30.316 < 2e-16 ***
## year 3.665e+00 1.204e-01 30.440 < 2e-16 ***
## km_driven_in_10k -8.999e-01 9.496e-02 -9.476 < 2e-16 ***
## fuelDiesel 9.099e+00 4.263e+00 2.134 0.0329 *
## fuelLPG 1.224e+01 6.564e+00 1.864 0.0623 .
## fuelPetrol 1.136e+00 4.257e+00 0.267 0.7895
## seller_typeIndividual -7.968e+00 1.283e+00 -6.211 5.66e-10 ***
## seller_typeTrustmark Dealer -4.302e+00 5.456e+00 -0.788 0.4305
## transmissionManual -1.537e+01 1.572e+00 -9.778 < 2e-16 ***
## ownerFourth & Above Owner -2.449e+00 2.483e+00 -0.986 0.3240
## ownerSecond Owner -5.020e+00 9.003e-01 -5.576 2.59e-08 ***
## ownerTest Drive Car 2.426e+02 1.200e+01 20.226 < 2e-16 ***
## ownerThird Owner -4.405e+00 1.542e+00 -2.856 0.0043 **
## engine 1.345e-02 1.198e-03 11.226 < 2e-16 ***
## max_power 4.937e-01 2.013e-02 24.530 < 2e-16 ***
## make_categoryLuxury 1.083e+02 3.051e+00 35.492 < 2e-16 ***
## make_categoryMidrange 1.661e+00 7.945e-01 2.091 0.0366 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 26.52 on 5353 degrees of freedom
## Multiple R-squared: 0.7243, Adjusted R-squared: 0.7235
## F-statistic: 879 on 16 and 5353 DF, p-value: < 2.2e-16
plot(bic_full_model,which=c(1,2))
vif(bic_full_model)
## GVIF Df GVIF^(1/(2*Df))
## year 1.679628 1 1.296005
## km_driven_in_10k 1.632010 1 1.277502
## fuel 1.635248 3 1.085419
## seller_type 1.128210 2 1.030617
## transmission 1.436714 1 1.198630
## owner 1.399679 4 1.042926
## engine 2.609471 1 1.615386
## max_power 2.970440 1 1.723497
## make_category 1.741865 2 1.148824
small_model = lm(selling_price_in_10k ~ make_category*year*max_power + year*transmission + owner + seller_type + km_driven_in_10k+mileage, data = train_data)
summary(small_model)
##
## Call:
## lm(formula = selling_price_in_10k ~ make_category * year * max_power +
## year * transmission + owner + seller_type + km_driven_in_10k +
## mileage, data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -168.924 -9.112 -1.273 7.930 215.795
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.745e+03 8.799e+02 3.120 0.00182 **
## make_categoryLuxury 8.824e+04 6.964e+03 12.671 < 2e-16 ***
## make_categoryMidrange 8.867e+03 1.084e+03 8.179 3.55e-16 ***
## year -1.356e+00 4.369e-01 -3.105 0.00191 **
## max_power -1.664e+02 7.047e+00 -23.613 < 2e-16 ***
## transmissionManual 1.795e+03 6.995e+02 2.566 0.01033 *
## ownerFourth & Above Owner -1.338e+00 1.850e+00 -0.723 0.46968
## ownerSecond Owner -2.834e+00 6.712e-01 -4.222 2.46e-05 ***
## ownerTest Drive Car 1.651e+02 9.013e+00 18.323 < 2e-16 ***
## ownerThird Owner -2.958e+00 1.149e+00 -2.575 0.01005 *
## seller_typeIndividual -2.473e+00 9.598e-01 -2.577 0.01000 *
## seller_typeTrustmark Dealer -4.393e+00 4.055e+00 -1.083 0.27871
## km_driven_in_10k 1.602e-01 6.538e-02 2.450 0.01430 *
## mileage -3.380e-01 8.386e-02 -4.031 5.63e-05 ***
## make_categoryLuxury:year -4.386e+01 3.456e+00 -12.689 < 2e-16 ***
## make_categoryMidrange:year -4.417e+00 5.385e-01 -8.203 2.90e-16 ***
## make_categoryLuxury:max_power -6.323e+02 3.692e+01 -17.126 < 2e-16 ***
## make_categoryMidrange:max_power -8.920e+01 1.201e+01 -7.425 1.30e-13 ***
## year:max_power 8.290e-02 3.500e-03 23.683 < 2e-16 ***
## year:transmissionManual -8.944e-01 3.470e-01 -2.577 0.00999 **
## make_categoryLuxury:year:max_power 3.145e-01 1.833e-02 17.159 < 2e-16 ***
## make_categoryMidrange:year:max_power 4.446e-02 5.966e-03 7.452 1.06e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.73 on 5348 degrees of freedom
## Multiple R-squared: 0.8476, Adjusted R-squared: 0.847
## F-statistic: 1416 on 21 and 5348 DF, p-value: < 2.2e-16
plot(small_model,which=c(1,2))
vif(small_model,type = "predictor")
## GVIFs computed for predictors
## GVIF Df GVIF^(1/(2*Df))
## make_category 52.058575 11 1.196802
## year 2.774981 13 1.040036
## max_power 52.058575 11 1.196802
## transmission 28.730113 3 1.750073
## owner 1.441031 4 1.046729
## seller_type 1.139833 2 1.033262
## km_driven_in_10k 1.398158 1 1.182437
## mileage 1.574993 1 1.254987
## Interacts With
## make_category year, max_power
## year make_category, max_power, transmission
## max_power make_category, year
## transmission year
## owner --
## seller_type --
## km_driven_in_10k --
## mileage --
## Other Predictors
## make_category transmission, owner, seller_type, km_driven_in_10k, mileage
## year owner, seller_type, km_driven_in_10k, mileage
## max_power transmission, owner, seller_type, km_driven_in_10k, mileage
## transmission make_category, max_power, owner, seller_type, km_driven_in_10k, mileage
## owner make_category, year, max_power, transmission, seller_type, km_driven_in_10k, mileage
## seller_type make_category, year, max_power, transmission, owner, km_driven_in_10k, mileage
## km_driven_in_10k make_category, year, max_power, transmission, owner, seller_type, mileage
## mileage make_category, year, max_power, transmission, owner, seller_type, km_driven_in_10k
n= nrow(train_data)
bic_small_model = step(small_model,direction="both",k=log(n),trace=0)
summary(bic_small_model)
##
## Call:
## lm(formula = selling_price_in_10k ~ make_category + year + max_power +
## transmission + owner + mileage + make_category:year + make_category:max_power +
## year:max_power + make_category:year:max_power, data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -168.318 -9.151 -1.366 7.847 213.414
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.597e+03 5.197e+02 8.846 < 2e-16 ***
## make_categoryLuxury 8.807e+04 6.960e+03 12.654 < 2e-16 ***
## make_categoryMidrange 9.576e+03 1.069e+03 8.961 < 2e-16 ***
## year -2.276e+00 2.584e-01 -8.809 < 2e-16 ***
## max_power -1.660e+02 6.965e+00 -23.834 < 2e-16 ***
## transmissionManual -8.056e+00 1.174e+00 -6.859 7.72e-12 ***
## ownerFourth & Above Owner -1.359e+00 1.849e+00 -0.735 0.46257
## ownerSecond Owner -2.828e+00 6.664e-01 -4.244 2.23e-05 ***
## ownerTest Drive Car 1.668e+02 9.009e+00 18.518 < 2e-16 ***
## ownerThird Owner -2.965e+00 1.145e+00 -2.590 0.00963 **
## mileage -3.298e-01 8.383e-02 -3.934 8.47e-05 ***
## make_categoryLuxury:year -4.377e+01 3.455e+00 -12.671 < 2e-16 ***
## make_categoryMidrange:year -4.769e+00 5.308e-01 -8.985 < 2e-16 ***
## make_categoryLuxury:max_power -6.406e+02 3.692e+01 -17.349 < 2e-16 ***
## make_categoryMidrange:max_power -9.691e+01 1.184e+01 -8.182 3.47e-16 ***
## year:max_power 8.271e-02 3.460e-03 23.905 < 2e-16 ***
## make_categoryLuxury:year:max_power 3.186e-01 1.833e-02 17.383 < 2e-16 ***
## make_categoryMidrange:year:max_power 4.829e-02 5.883e-03 8.209 2.77e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.76 on 5352 degrees of freedom
## Multiple R-squared: 0.8471, Adjusted R-squared: 0.8466
## F-statistic: 1744 on 17 and 5352 DF, p-value: < 2.2e-16
plot(bic_small_model,which=c(1,2))
vif(small_model,type = "predictor")
## GVIFs computed for predictors
## GVIF Df GVIF^(1/(2*Df))
## make_category 52.058575 11 1.196802
## year 2.774981 13 1.040036
## max_power 52.058575 11 1.196802
## transmission 28.730113 3 1.750073
## owner 1.441031 4 1.046729
## seller_type 1.139833 2 1.033262
## km_driven_in_10k 1.398158 1 1.182437
## mileage 1.574993 1 1.254987
## Interacts With
## make_category year, max_power
## year make_category, max_power, transmission
## max_power make_category, year
## transmission year
## owner --
## seller_type --
## km_driven_in_10k --
## mileage --
## Other Predictors
## make_category transmission, owner, seller_type, km_driven_in_10k, mileage
## year owner, seller_type, km_driven_in_10k, mileage
## max_power transmission, owner, seller_type, km_driven_in_10k, mileage
## transmission make_category, max_power, owner, seller_type, km_driven_in_10k, mileage
## owner make_category, year, max_power, transmission, seller_type, km_driven_in_10k, mileage
## seller_type make_category, year, max_power, transmission, owner, km_driven_in_10k, mileage
## km_driven_in_10k make_category, year, max_power, transmission, owner, seller_type, mileage
## mileage make_category, year, max_power, transmission, owner, seller_type, km_driven_in_10k
bptest(bic_small_model)
##
## studentized Breusch-Pagan test
##
## data: bic_small_model
## BP = 1694.2, df = 17, p-value < 2.2e-16
anova(bic_small_model,bic_full_model)
## Analysis of Variance Table
##
## Model 1: selling_price_in_10k ~ make_category + year + max_power + transmission +
## owner + mileage + make_category:year + make_category:max_power +
## year:max_power + make_category:year:max_power
## Model 2: selling_price_in_10k ~ year + km_driven_in_10k + fuel + seller_type +
## transmission + owner + engine + max_power + make_category
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 5352 2089387
## 2 5353 3766202 -1 -1676815 4295.2 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
indicies_to_exclude = unname(which(cooks.distance(bic_small_model) > 4/length(cooks.distance(bic_small_model))))
sum(cooks.distance(bic_small_model) > 4/length(cooks.distance(bic_small_model)))
## [1] 236
train_data_filtered = train_data[-indicies_to_exclude,]
n= nrow(train_data_filtered)
refitted_wip = lm(log(selling_price_in_10k) ~ .+ make_category*year+ make_category*max_power -make-seats-engine, data = train_data_filtered)
#make_category*year*max_power + transmission + owner + seller_type + km_driven_in_10k+mileage, data = train_data_filtered)
summary(refitted_wip)
##
## Call:
## lm(formula = log(selling_price_in_10k) ~ . + make_category *
## year + make_category * max_power - make - seats - engine,
## data = train_data_filtered)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.37471 -0.16777 0.01752 0.19196 1.06812
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.340e+02 3.332e+00 -70.227 < 2e-16 ***
## year 1.177e-01 1.660e-03 70.898 < 2e-16 ***
## km_driven_in_10k -2.605e-03 1.087e-03 -2.396 0.016595 *
## fuelDiesel 2.903e-01 4.701e-02 6.176 7.07e-10 ***
## fuelLPG 1.244e-01 7.234e-02 1.720 0.085480 .
## fuelPetrol 1.360e-02 4.699e-02 0.289 0.772240
## seller_typeIndividual -5.741e-02 1.481e-02 -3.876 0.000107 ***
## seller_typeTrustmark Dealer 6.331e-02 6.125e-02 1.034 0.301410
## transmissionManual -6.795e-02 1.876e-02 -3.621 0.000296 ***
## ownerFourth & Above Owner -1.569e-01 2.826e-02 -5.552 2.97e-08 ***
## ownerSecond Owner -7.787e-02 1.009e-02 -7.719 1.40e-14 ***
## ownerThird Owner -1.092e-01 1.729e-02 -6.317 2.90e-10 ***
## mileage -8.665e-03 1.354e-03 -6.398 1.71e-10 ***
## max_power 9.532e-03 2.705e-04 35.237 < 2e-16 ***
## make_categoryLuxury -4.440e+01 4.255e+01 -1.043 0.296825
## make_categoryMidrange 5.666e+00 4.655e+00 1.217 0.223643
## year:make_categoryLuxury 2.272e-02 2.116e-02 1.074 0.283062
## year:make_categoryMidrange -2.834e-03 2.314e-03 -1.225 0.220648
## max_power:make_categoryLuxury -5.158e-03 2.729e-03 -1.890 0.058795 .
## max_power:make_categoryMidrange 1.403e-03 3.718e-04 3.774 0.000163 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2913 on 5114 degrees of freedom
## Multiple R-squared: 0.8239, Adjusted R-squared: 0.8233
## F-statistic: 1260 on 19 and 5114 DF, p-value: < 2.2e-16
bic_refitted_wip = step(refitted_wip,direction="both",k=log(n),trace=0)
summary(bic_refitted_wip)
##
## Call:
## lm(formula = log(selling_price_in_10k) ~ year + fuel + seller_type +
## transmission + owner + mileage + max_power + make_category,
## data = train_data_filtered)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.38778 -0.16910 0.01772 0.19303 1.05425
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.330e+02 2.782e+00 -83.731 < 2e-16 ***
## year 1.171e-01 1.388e-03 84.420 < 2e-16 ***
## fuelDiesel 2.747e-01 4.690e-02 5.858 4.99e-09 ***
## fuelLPG 1.116e-01 7.235e-02 1.543 0.123
## fuelPetrol 1.139e-02 4.686e-02 0.243 0.808
## seller_typeIndividual -6.386e-02 1.473e-02 -4.335 1.49e-05 ***
## seller_typeTrustmark Dealer 6.101e-02 6.134e-02 0.995 0.320
## transmissionManual -7.852e-02 1.863e-02 -4.214 2.56e-05 ***
## ownerFourth & Above Owner -1.610e-01 2.828e-02 -5.694 1.31e-08 ***
## ownerSecond Owner -7.905e-02 1.007e-02 -7.853 4.93e-15 ***
## ownerThird Owner -1.109e-01 1.728e-02 -6.417 1.51e-10 ***
## mileage -7.517e-03 1.325e-03 -5.673 1.48e-08 ***
## max_power 1.010e-02 2.205e-04 45.817 < 2e-16 ***
## make_categoryLuxury 3.766e-01 6.430e-02 5.857 5.02e-09 ***
## make_categoryMidrange 7.974e-02 8.849e-03 9.011 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2918 on 5119 degrees of freedom
## Multiple R-squared: 0.8231, Adjusted R-squared: 0.8226
## F-statistic: 1701 on 14 and 5119 DF, p-value: < 2.2e-16
vif(bic_refitted_wip,type = "predictor")
## GVIFs computed for predictors
## GVIF Df GVIF^(1/(2*Df)) Interacts With
## year 1.722679 1 1.312509 --
## fuel 1.358473 3 1.052386 --
## seller_type 1.087360 2 1.021159 --
## transmission 1.194923 1 1.093125 --
## owner 1.351413 3 1.051473 --
## mileage 1.635368 1 1.278815 --
## max_power 1.902386 1 1.379270 --
## make_category 1.274009 2 1.062412 --
## Other Predictors
## year fuel, seller_type, transmission, owner, mileage, max_power, make_category
## fuel year, seller_type, transmission, owner, mileage, max_power, make_category
## seller_type year, fuel, transmission, owner, mileage, max_power, make_category
## transmission year, fuel, seller_type, owner, mileage, max_power, make_category
## owner year, fuel, seller_type, transmission, mileage, max_power, make_category
## mileage year, fuel, seller_type, transmission, owner, max_power, make_category
## max_power year, fuel, seller_type, transmission, owner, mileage, make_category
## make_category year, fuel, seller_type, transmission, owner, mileage, max_power
plot(bic_refitted_wip,which=c(1,2))
bptest(bic_refitted_wip)
##
## studentized Breusch-Pagan test
##
## data: bic_refitted_wip
## BP = 574.66, df = 14, p-value < 2.2e-16